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Abstract 

Background: Diagnosing subclinical atherosclerosis is often difficult since patients are asymptomatic. In order to 
alleviate this limitation, we have developed a molecular prediction technique for predicting patients with 
atherogenic risks using multi-gene expression biomarkers on leukocytes. 

Methods: We first discovered 356 expression biomarkers which showed significant differential expression between 
genome-wide microarray data of monocytes from patients with familial hyperlipidemia and increased risk of 
atherosclerosis compared to normal controls. These biomarkers were further triaged with 56 biomarkers known to 
be directly related to atherogenic risks. We also applied a COXEN algorithm to identify concordantly expressed 
biomarkers between monocytes and each of three different cell types of leukocytes. We then developed a multi- 
gene predictor using all or three subsets of these 56 biomarkers on the monocyte patient data. These predictors 
were then applied to multiple independent patient sets from three cell types of leukocytes (macrophages, 
circulating T cells, or whole white blood cells) to predict patients with atherogenic risks. 

Results: When the 56 predictor was applied to the three patient sets from different cell types of leukocytes, all 
significantly stratified patients with atherogenic risks from healthy people in these independent cohorts. 
Concordantly expressed biomarkers identified by the COXEN algorithm provided slightly better prediction results. 

Conclusion: These results demonstrated the potential of molecular prediction of atherogenic risks across different 
cell types of leukocytes. 



Background 

Atherosclerosis is the leading cause of death in most 
countries [1]. In the United States alone, it is estimated 
that more than 16 million people have atherosclerotic 
heart disease resulting in 870, 000 deaths in 2005. Diag- 
nosis of subclinical atherosclerosis is difficult since indi- 
viduals are still asymptomatic. Early detection of 
atherosclerosis may help to prevent complications of the 
disease or slow its progression [2]. Atherosclerosis risk 
can first be indirectly predicted by clinical disease risk 
factors such as diabetes, LDL, HDL, or triglyceride. The 
progression of atherosclerosis, however, is a complex 
multifactorial process which involves multiple molecular 
mechanisms such as lipid deposition in arteries, smooth 
muscle cell proliferation, thrombogenesis, and platelet 
aggregation so it has been found to be difficult to 
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predict atherosclerosis by these standard risk factors 
alone. For example, Tertov and his collaborators [3] 
found there was a lack of correlation between the degree 
of human plasma low density lipoprotein oxidation and 
its atherogenic potential (r = 0.10, n = 127) and LDL 
levels showed a weak correlation with a coefficient of 
0.12 for total atheroma volume. 

Tests based on surgical patient samples such as arter- 
ial tissue or atherosclerotic plaques are used to diagnose 
atherosclerosis but are impractical [4,5]. High through- 
put molecular techniques have also been used to better 
understand the pathogenesis and progression of athero- 
sclerosis but have not been fully utilized for diagnostics 
[4,5]. Invasive or non-invasive imaging methods for pla- 
que in human coronary arteries are currently used to 
detect the more exact status of atherosclerosis in 
patients [6,7]. However, use of these imaging methods 
for plaque detection in human coronary arteries is costly 
and associated with the risk of adverse events so are 
often restricted to a small proportion of patients who 
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already show high-risk features and clinical symptoms 
[7]. Consequently, despite these biotechnical and mole- 
cular efforts, diagnosis of subclinical atherosclerosis 
remains difficult. 

Identifying genome-wide biomarkers of various mole- 
cular mechanisms relevant to atherogenic risks, we here 
developed multi-gene prediction signatures for stratify- 
ing patients with atherosclerosis using peripheral blood 
samples. In particular, we addressed two questions in 
this study: 1) whether subclinical atherosclerosis can be 
diagnosed by molecular biomarkers in peripheral blood 
samples, and 2) whether they share certain common 
molecular expression signatures across different cell 
types of leukocytes. 

Methods 

Patient and Microarray Data Sets 

Four published microarray sets (Table 1) of different 
leukocyte subsets from atherogenic patients and healthy 
controls were used to construct and validate our multi- 
gene predictors. The first set of patients with familial 
hypercholesterolemia (FH) with microarray data of 
monocytes, FHl (NCBI GEO GSE6054), was used as 
our training set [8]. Familial hypercholesterolemia is a 
well-known genetic disease caused by a mutation (1 out 
of 500 people) in the genetic encoding for the LDL 
receptor (LDLR) gene [9]. This mutation results in dra- 
matically elevated risks of atherosclerosis with most 
patients experiencing atherogenic complications in their 
early age. FH patients have thus been frequently fol- 
lowed and investigated for atherosclerosis risks and dis- 
ease mechanisms. In particular, the risk of 
atherosclerosis among homozygous FH patients with 
two abnormal copies of the LDLR gene is increased 5- 
to 10-fold from that of patients with one abnormal copy 
or heterozygous FH patients. This set consists of 23 sub- 
jects: 3 homozygous FH patients, 7 heterozygous FH 
patients, and 13 healthy controls. 

Three additional data sets of patients were obtained 
and used as test sets for our multi-gene predictors. The 
first test set, FH2 (NCBI GEO GSE6088), was from the 



same FH patients in the training set but with microarray 
data of different cell, T-lymphocytes [8]. The second test 
set, FH3 (NCBI GEO GSE13985), is a completely inde- 
pendent set of 10 FH patients. In this study, whole 
white blood cells from five FH patients and five controls 
matched in age, sex, BMI, and smoking status were used 
for gene expression profiling. All patient samples in 
FHl, FH2, and FH3 sets were profiled with Affymetrix 
HG-U133 plus 2.0 GeneChip microarrays. 

The third test set, ATHEROl (NCBI GEO GSE9874), 
was comprised of 15 patients with asymptomatic athero- 
sclerosis and a family history of coronary heart disease 
and 15 age- and sex-matched subjects with no athero- 
sclerosis and no family history of coronary heart disease 
as control [10]. Two arrays were excluded after our 
initial quality control analysis. Therefore a total of 28 
samples was chosen for the model test. The microarrays 
were obtained from monocyte-derived macrophage cells 
of these patients' peripheral blood samples with Affyme- 
trix HG-U133A GeneChips. Affymetrix HG-U133A is a 
part of Affymetrix HG-U133 plus 2.0 so we chose the 
common probe sets for our model construction and 
data analysis. 

Development of Multi-Gene Predictors of Atherosclerosis 

Our model training of atherosclerosis predictors was 
composed of three sequential analysis steps (Figure 1). 
The first was the identification of significant differen- 
tially expressed genes by comparing FH patients and 
normal controls in the training set by a two sample t- 
test. In order to avoid the multiple comparisons pitfall 
in the high throughput biomarker discovery, we adjusted 
the statistical significance by False Discovery Rate (FDR) 
q-value < 0.01 [11]. 

In the second step, we identified a subset of the above 
biomarkers known to have functions directly related to 
atherosclerosis and cardiovascular diseases. Biological 
functional analysis on these genes was performed by the 
Ingenuity Pathway Analysis (IPA; http://www.ingenuity. 
com; Redwood City, CA). Genes related to inflamma- 
tion, lipid metabolism process and another metabolic 



Table 1 Microarray datasets were used for atherosclerosis prediction model derivation and evaluation. 



Data Set 



Human Blood 
Cell 



Function GEO ID 



Disease information 



Familial Hypercholesterolemia 

Familial Hypercholesterolemia 

Familial Hypercholesterolemia 

Patient with subclinical atherosclerosis and a family history of coronary heart 

disease 



Samples 



FH1 Monocyte 

FH2 Circulating T cells 

FH3 White Blood cells 

ATHER01 Macrophage 



Training GSE6054 

Test GSE6088 

Test GSE 13985 

Test GSE9874 



23 
23 
10 
28 



Four previously-published microarray sets in the NCBI GEO database (http://www.ncbi.nlm.nih.gov/geo) from different human blood cells were used to construct 
and validate our prediction model. Data set FH1 was chosen and used for our multi-gene model development. The other three sets — FH2, FH3, and ATHER01 — 
were used for evaluation of the atherosclerosis prediction model. 
GEO web address: http://www.ncbi.nlm.nih.gov/geo/ 
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Figure 1 Schematic diagram of the atherosclerosis prediction 
model construction and validation processes. Our prediction 
model was developed by four distinct steps - 1) the identification of 
significant genes by t-test analysis of the training set FH1, 2) 
selected biomarkers known to be directly related to atherosclerosis 
performed by Ingenuity Pathway Analysis, 3) (optional) selection of 
COXEN biomarkers among the atherosclerosis-related gene 
biomarkers obtained from the third step which led to the 48, 6, and 
26 biomarkers for test set 1 (circulating T-cell), test set 2 (white 
blood cell) and test set 3 (macrophage), 4) the multivariate 
prediction model construction based on universal or COXEN 
predictors. 



process (including carbohydrate protein and nucleic acid 
metabolism) and hematological system developments 
were selected [12,13]. 

The last step (Step 3) was the multivariate prediction 
model construction. Microarray expression values were 
first standardized within each gene for each set in order 
to adjust for biased expression distributions of the four 
independent data sets obtained under different experi- 
mental settings. We then applied principal component 
analysis (PCA) to reduce the high dimension of the 
multi-gene biomarker space. Top three PCAs were cho- 
sen to convey the majority (> 70%) of the information 
of original genes while their loadings (or weights) reflect 
the influence of the original genes. 

We trained our prediction model using the linear dis- 
criminant analysis (LDA) technique, a widely-used mul- 
tivariate classification technique for our prediction 
modeling. First, 23 human samples in the training set 
were divided into a low-risk group (normal controls) 
and a high-risk group (patients with familial hypercho- 
lesterolemia). The LDA is then performed to obtain a 
linear discriminant function for the two risk classes by 
maximizing the prediction power or minimizing the 



overall classification error both for false positives and 
false negatives with equal weights. The output of the 
prediction model is the LDA posterior probabilities of 
atherosclerosis between 0 (low-risk) and 1 (high-risk). 
Thus, a high predicted score (or posterior probability) of 
a subject implies a high risk of atherosclerosis whereas a 
low predicted score suggests a low risk of 
atherosclerosis. 

The performance of prediction models was evaluated 
to examine if the predicted (posterior) probability of 
atherosclerosis risk could statistically discriminate 
patients at high risk for atherosclerosis from normal 
controls in these three test sets. The statistical signifi- 
cance of the difference in scores was first evaluated 
using the two-sample t-test between high risk and con- 
trol groups. To define an optimal cutoff value for each 
predictor, a Receiver Operating Characteristic (ROC) 
curve was plotted for the predicted scores of athero- 
sclerosis risk and phenotype information on each set. 
The cutoff value on each ROC curve was then deter- 
mined by maximizing the Youden index (= sensitivity 
+specificity-l) [14]. Sensitivity, specificity, positive pre- 
dictive value, (PPV), and negative predictive value 
(NPV) were then calculated to evaluate its prediction 
performance at the optimal cutoff values. 

Sensitivity, specificity, PPV, NPV were defined as 



Sensitivity = 



Specificity = 



number of true positives 



number of true positives + number of false negatives 



number of true negatives 



number of true negatives + number of false positives 



PPV -- 



NPV - 



number of true positives 



number of true positives + number of false positives 



number of true negatives 



number of true negatives + number of false negatives 



We also applied the COXEN algorithm, a previously 
published bioinformatics technique that can identify 
concordantly (expression) regulated genes between dif- 
ferent cancer types [15,16] to improve our prediction on 
patient sets from different cell types of leukocytes. 
White blood cells (WBC) are a heterogeneous mixture 
of many different cell types, each of which can present 
some different expression patterns [8]. Here, we applied 
this technique to identify concordantly expressed bio- 
markers between different types of leukocytes. The 
COXEN algorithm compared expression patterns among 
candidate genes between the two sets to find concor- 
dantly expressed genes between them. Figure 2 is a 
schematic illustration of the COXEN algorithm by using 
an artificial 5-gene probe example [15]. In this figure, 
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Set 1 Co-expression Network Set 2 Co-expression Network 




: Positive Correlation 

: Negative Correlation 

Line thickness indicate strength 

of correlation 



In these two sets, probes 4 and 5 are 
similarly correlated to other probes. 
They are selected as "co-expressed" 
probes between sets 1 and 2. 



Figure 2 A schematic illustration of COXEN algorithm by using an artificial five-probe example. Probes 1 and 3 in cell set show 
essentially the same patterns as probes 1 and 3 in cell set 2. These two probes have the same co-expression correlation with other probes in 
these two cell sets. On the contrary, probes 2, 4, and 5 show different patterns of co-expression correlation in these two cell sets. Therefore, 
Probes 1 and 3 (but not 2, 4, and 5) will be selected by the COXEN algorithm. 



gene probes 1 and 3 in cell set 1 (e.g., the monocyte in 
this paper) essentially show the same patterns as probes 
1 and 3 in cell set 2 (e.g., the T-lymphocyte). That is to 
say, these two probes have the same co-expression cor- 
relation with other probes in these two cell sets. This 
co-expression correlation is analyzed by Spearman cor- 
relation. On the contrary, probes 2, 4, and 5 show differ- 
ent patterns of co-expression correlation in these two 
cell sets. In this example, gene probes 1 and 3 (but not 
2, 4, and 5) are selected by the COXEN algorithm. 
More technical details of the COXEN analysis can be 
found elsewhere [15,16]. By using the COXEN algorithm 
to the above monocyte- derived biomarkers (identified in 
Step 2), we identified three subsets of genes from these 
56 genes which showed concordant expression network 
patterns between monocyte and each of T-lymphocyte, 
macrophage, or whole WBC, respectively. We also 
tested the predictive performance of these COXEN 
genes in this paper. 



Results 

Identification of atherosclerosis-associated biomarkers 

The initial biomarker discovery step (Step 1) identified 
363 differentially expressed genes by comparing FH 
patients and normal controls in the training set FH1 
with q- value < 0.01 (Figure 1). The second function ana- 
lysis was performed by identifying genes of which biolo- 
gical functions might be closely related to 
atherosclerosis using the Ingenuity Pathway Analysis 
(IP A) software (Redwood City, CA). In particular, Genes 
related to inflammation, lipid metabolism process and 
another metabolic process and hematological system 
developments were selected. This functional analysis 
was necessary since differential expression of some of 
the initial biomarkers might not be directly relevant to 
the atherogenic disease mechanisms and only specific to 
the training patient set FH1 we used. This step resulted 
in 56 gene probes whose functional information is sum- 
marized in Table 2. 
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Table 2 Selected atherosclerosis related genes for model derivation. 
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8.40E-03 


-1.29 


MO/WBC/MA 


Carbohydrate Metabolism 


207319_s_at 


CDC2L5 


8.46E-05 


8.70E-03 


-1.59 


MO/MA 


Hematological System Development and Function 


207764_s_at 


HIPK3 


9.28E-05 


9.18E-03 


-1.57 


MO/MA 


Other Metabolic Processes 


209532_at 


PLAA 


9.57E-05 


9.33E-03 


-1.19 


MO 


Lipid Metabolism 
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Table 2 Selected atherosclerosis related genes for model derivation. (Continued) 



210333_at 


NR5A1 


9.76E-05 


9.43E-03 


1.37 


MO/WBC/MA 


Hematological System Development and Function 


205546_s_at 


TYK2 


1 .02E-04 


9.67E-03 


1.16 


MO 


Inflammatory Response 


20720 1_s_at 


SLC22A1 


1.10E-04 


9.97E-03 


1.33 


MO/MA 


Other Metabolic Processes 


213733_at 


MY01F 


1.11E-04 


1 .00E-02 


1.15 




Inflammatory Response 


214971_s_at 


ST6GAL1 


1.11E-04 


1 .00E-02 


-1.46 


MO/MA 


Inflammatory Response 



These genes were identified by t-test, biological function analysis and COXEN algorithm 
MO: Monocyte (Test setl); WBC: White blood cell (Test set2); MA: Macrophage (Test set3) 



Independent Evaluation of Atherosclerosis Risk Prediction 
for Test Sets 

Broad predictive information of these genes could be 
found in a standard hierarchical clustering analysis 
using the so-called McQuittys or WPGMA algorithm 
which is known to be robust with highly varying sizes of 
clusters by using the average distance between clusters 



weighted by uneven cluster sizes. We performed a clus- 
tering analysis of the 56 biomarkers on one test data- 
set— FH2— by standardizing (subtracted by its mean and 
divided by its standard deviation) each gene's expression 
values across all patient samples (Figure 3). This cluster- 
ing heatmap showed that all but two FH patients clus- 
tered into their respective groups. These 56 genes were 





>- Biological regulation 
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Figure 3 Clustering Analysis of 56 genetic biomarkers for test set FH2. Test set FH2 was standardized (subtracted by its mean and divided 
by its standard deviation) for each gene and a hierarchical clustering analysis was performed on this set. McQuitty's or WPGMA method, which 
uses the average distance between clusters weighted by uneven cluster sizes, was used. Three functional subclusters of the 56 biomarkers were 
identified by examining common biological functions of these gene subclusters by the function annotation tool in DAVID database (http://david. 
abcc.ncifcrf.gov). 
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Table 3 Prediction Results for three test sets. 





No. of genes for prediction 


T-test (P-value) 


Sensitivity (%) 


Specificity (%) 


PPV (%) 


NPV (%) 


Youden index 


56 gene model 


FH2 


56 


0.011 


70.0 


84.6 


77.8 


78.6 


0.55 


FH3 


56 


0.010 


80.0 


100 


100 


0.83 


0.8 


ATHR01 


56 


0.019 


71.4 


71.4 


71.4 


71.4 


0.43 


COXEN model 




48 


0.0024 


70.0 


92.3 


87.5 


80.0 


0.62 




6 


1.5e-10 


100 


100.0 


100.0 


100 


1.00 




25 


0.010 


78.6 


66.7 


68.8 


76.9 


0.45 



Data sets FH2, FH3, and ATHER01 were used to evaluate the classification performance of our atherosclerosis prediction model. We tested if the predicted risk 
of atherosclerosis could statistically discriminate familial hypercholesterolemia or general atherosclerosis patients with normal people in these test sets. The 
statistical significance of the difference in predicted scores (COXEN SCORE) was evaluated using the two-sample t-test between patients and normal groups. 
Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were also calculated. 



also found to be well clustered into three sub-functional 
categories: biological regulation, localization, and meta- 
bolic process. 

We independently evaluated the performance of our 
56-gene predictor of T-cells on the FH2 set (Table 3). To 
examine the distributional differences of the predicted 
scores among 3 homozygous FH patients, 7 heterozygous 
FH patients and 13 normal controls in FH2 data set, the 
predicted scores of the two groups were plotted (Figure 
4). We also compared predicted scores between the high- 
risk group (FH patients) and low-risk group (controls) by 
the Student's two-sample t-test. The predicted scores of 
the control (low-risk) group were significantly lower than 



those of the FH patient (high-risk) group (p = 0.01). 
Furthermore, the predicted scores of homozygous FH 
patients with a higher risk of atherosclerosis were found 
to be considerably higher than those of heterozygote FH 
patients who generally show a lower risk of atherosclero- 
sis. The high- and low-risk groups could be well distin- 
guished at the optimal cutoff value in the ROC analysis 
by maximizing the Youden index (with sensitivity = 
70.0% and specificity = 92.3%). Thus, the 56-gene 
COXEN predictor for T-lymphocytes significantly strati- 
fied the two different risk groups of atherosclerosis. 

In the next application, we used another independent 
patient dataset FH3 of total white blood cells to test the 



P-value=0.011 



P-value=0.0024 
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Figure 4 Prediction results of the data set FH2 using universal and COXEN predictors. Comparison of LDA probabilities of Familial 
Hypercholesterolemia (FH) patients and control people in data set FH2. Normal controls, homozygous and heterozygous FH patients are labeled 
with dots, empty and filled triangles respectively. The statistical significance (p-value) of the set of predictions was assessed by a two sample t- 
test. 
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predictability of the same 56-gene model for WBC. This 
small cohort consisted of 5 FH patient and 5 healthy 
controls. As shown in Figure 5, this 56-gene predictor 
was also able to significantly discriminate the two 
groups showing predicted scores of the FH patients sig- 
nificantly higher than those of the controls (two sample 
t-test p-value = 0.01, Table 3) with perfect classification 
specificity = 1 and sensitivity = 1 at the Youden cutoff 
value for this small set. 

We next applied the identical 56-gene predictor to the 
ATHEROl set of asymptomatic atherosclerosis patients. 
This set included 28 blood macrophage samples from 
14 patients confirmed with asymptotic atherosclerosis 
and 14 healthy controls, both groups with a family his- 
tory of coronary artery disease. In particular, 14 patients 
were diagnosed with atherosclerosis by CTA but they 
had not shown any clinical symptoms with little clinical 
difference from the control subjects with a family his- 
tory of CAD. Therefore, this stratification was challen- 
ging and we believe it could be similar to difficult cases 
of diagnosis in clinics. 

Using the prediction model of these 56 biomarkers on 
FH1, we tested this multi-gene predictor on macrophage 
cell data for predicting the risk of asymptomatic athero- 
sclerosis. As shown in Figure 6, the predicted scores of 
asymptomatic atherosclerosis patients were significantly 
higher than those of the controls (t-test p-value = 0.02, 
Table 3). This result was again quite encouraging indi- 
cating that our multi-gene model could provide the pre- 
dictability for asymptomatic atherosclerosis patients. 



COXEN applications 

While our universal 56-gene model showed consistent 
prediction performance on the three independent test 
sets, we further investigated whether these predictions 
could be improved by using biomarkers that preserved 
highly concordant expression patterns between different 
cell types of leukocytes. In particular, we used the 
COXEN algorithm originally developed to discover con- 
cordant expression biomarkers between cancer cell lines 
and human patients with cancer. Applying the COXEN 
algorithm, we identified 48, 6, and 25 genes (among 
these 56 genes) that preserved highly concordant gene 
expression patterns between patient sets FH1 and FH2, 
FH1 and FH3, or FH1 and ATHEROl, respectively, with 
a Pearson correlation association p-value < 0.001 (Table 
2). 

The number of COXEN genes generally reflects the 
similarities between two data sets. FH1 and FH2 were 
found to share the most concordantly expressed genes 
(48 genes among 56) since these two data sets were 
obtained from the same set of patients. For the other 
two data sets, the ATHEROl set from macrophages was 
found to result in more COXEN genes (25 genes among 
56) that were concordantly expressed with those of the 
FH1 monocyte set than the WBC FH3 set (6 among 
56). Thus, macrophage cells appeared to preserve many 
common gene expression patterns with monocyte cells. 
On the contrary, WBC, which is a mixture of different 
cell types, showed less consistent gene expression pat- 
terns with monocytes than the other two sets. As shown 
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Figure 5 Prediction results of the data set FH3 using universal and COXEN predictors. Comparison of LDA probabilities of FH patients and 


healthy controls in FH3. The statistical significance (p-value) of the set of predictions was assessed by a two sample t-test. 
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Figure 6 Prediction results of the data set ATHER01 using universal and COXEN predictors. Comparison of LDA probabilities of patient 
with subclinical atherosclerosis and a family history of coronary heart disease (CHD) and the age and sex-matched subjects with no 
atherosclerosis and no family history of CHD (control group). The statistical significance (p-value) of the set of predictions was assessed by a two 
sample t-test. 



in Table 3 these COXEN predictors provided even bet- 
ter prediction results. The three COXEN biomarker pre- 
dictors showed very significant t-test p-values for 
stratifying high risk patients from controls to 0.0024, 
1.5e-10, and 0.01, respectively. Thus, the COXEN pre- 
dictors showed considerable improvement from the uni- 
versal prediction model. 

Gene function analysis by the IPA software (Redwood 
City, CA) showed 6 of 48 biomarkers of T-lymphocyte 
cells were directly related to leukocyte activation and a 
change in morphology and behavior of a leukocyte 
including lipopolysaccharide binding protein (LBP), 
interleukin 27 receptor, alpha (IL27RA), mitogen-acti- 
vated protein kinase kinase kinase 7 (MAP3K7), pallidin 
homolog (PLDN), signal transducer and activator of 
transcription 6, interleukin-4 induced (STAT6) and C- 
type lectin domain family 7, member A (CLEC7A). This 
clearly implied that T-lymphocyte activation was 
involved in atherogenesis. 6 genes selected for white 
blood cell FH3 prediction were phospholipid scramblase 
3 (PLSCR3), nischarin (NISCH), bromodomain contain- 
ing 4 (BRD4), huntingtin (HTT), ribulose-5-phosphate- 
3-epimerase (RPE) and nuclear receptor subfamily 5, 
group A, member 1 (NR5A1). Also, the 6 biomarkers 
were the phosphoproteins of which phosphate groups 
regulate protein functions by esterifying to serine, threo- 
nine or tyrosine. In particular, NR5A1 is known to be a 
nuclear receptor that regulates the transcription of 



many genes involved in atherogenesis and PLSCR3 has 
been proved to be related to atherosclerosis develop- 
ment by animal studies [17]. 

Discussion 

In this study, we developed multi-gene biomarker mod- 
els to predict early-stage (asymptomatic) atherosclerosis 
based on different types of leukocytes. In particular, we 
have shown a proof-of-concept strategy to predict 
asymptomatic atherosclerosis by using molecular bio- 
markers from peripheral blood samples, and the exis- 
tence of common molecular expression signatures of 
atherogenic risks across different cell types of WBC. We 
believe these predictions were possible due to several 
reasons. First, gene expression signatures from patient 
blood samples appear to have a high potential to predict 
atherosclerosis in its early stage since certain molecular 
changes in blood cells may occur much before the pla- 
que development or serious clinical symptoms. Second, 
a genome-wide microarray technique enabled us to 
comprehensibly identify relevant molecular expression 
signatures of atherogenic risks beyond the information 
obtained from standard clinical parameters. We also 
believe our multi-gene predictors could predict the risk 
of atherosclerosis more accurately than predictors based 
on single biomarkers or a small number of clinical para- 
meters. Finally, we found many biomarkers of athero- 
sclerosis shared consistent expression patterns across 
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different leukocyte subsets. It will be interesting to 
further investigate these common biomarkers for their 
specific roles in the disease. 

We note that FH1 and FH2 were from the identical set 
of familial hyperglycemia patients and healthy controls, 
and possibly that our significant prediction on FH2 is 
correlated with the use of the identical patient set. How- 
ever, we believe that our significant prediction on FH2 is 
not mainly due to the use of the same patient set for the 
following reasons. First, the molecular data of the two 
sets were from completely different immune cells — FH1 
from monocytes and FH2 from circulating T cells. Since 
our biomarker discovery and predictive modeling were 
performed strictly based on monocyte cells of the FH1 
set, FH2 is independent of FH1 for its molecular charac- 
teristics and data. Second, we observed that our identical 
prediction model performed considerably better for 
white blood samples on FH3, a completely independent 
set of FH patients and controls from the FH1 set. We 
think this was due to the fact that monocytes are partially 
included in white blood cells so our monocyte-based pre- 
dictor presented a better predictability for that set. 
Therefore, the common molecular information appears 
to be more important than the use of specific patient set 
for our training and prediction. 

We believe our approach can be highly useful for clin- 
ical diagnosis of atherosclerosis for the following rea- 
sons. First, we used molecular signatures from patients' 
blood samples that can be conveniently obtained in rou- 
tine clinical practice. Second, we found that molecular 
signatures of atherogenic risks exist and are commonly 
shared among different types of leukocytes so we may 
be able to choose and further refine diagnosis tests 
based on one or multiple types depending on their accu- 
racy and clinical applicability. For example, clinical data 
showed that monocyte cells are involved in very early 
pathogenetic stages of atherosclerosis [18]. In particular, 
when atherosclerosis plaques (or foam cells) are differ- 
entiated from monocytes recruited from circulating 
blood, critical molecular changes appear to occur much 
before any clinical symptoms of atherosclerosis. Thus, a 
molecular test based on blood monocyte cells may serve 
as an effective diagnosis tool for an early stage of ather- 
osclerosis. Also, as seen in several patient data sets we 
investigated here, microarray profiling of a small amount 
of blood cells can now be efficiently and cost-effectively 
performed so its use becomes quite practical for clinical 
applications as well as scientific investigations. However, 
once final biomarkers and prediction models are identi- 
fied and finalized, diagnosis tests and assays can also be 
developed with more economic and convenient techni- 
ques such as RT-PCR. 

There are several limitations in our current study. 
First, our primary biomarker discovery and prediction 



model training were performed by contrasting familial 
hypercholesterolemia patients against healthy controls. 
Likely due to this restriction, our prediction was better 
in stratifying FH patients from healthy controls than 
general subclinical atherosclerosis patients. When we 
reversed the role of training and test sets in our preli- 
minary analysis, i.e. used the subclinical atherosclerosis 
patient set for model training and FH patient sets for 
independent model test, the prediction results were gen- 
erally deteriorated, possibly due to the small sample size 
of the subclinical atherosclerosis patient set (data not 
shown). Questions regarding whether predictors can 
perform significantly better if they are trained based on 
the same disease type of patients and/or same subtype 
of blood cells requires more careful investigation with a 
larger number of patient data in a future study. Also, we 
found that COXEN biomarker discovery and modeling 
training based on monocyte data was more successful to 
predict risks based on other cell types than other direc- 
tions, e.g. T-cell data for training to predict the others. 
It may be due to the data quality or biological informa- 
tion in the monocyte data which requires further 
investigation. 

Even though we did not use patients' outcome infor- 
mation in our COXEN-based predictions, we partially 
used the molecular information of our test patient data 
sets in the current study. A more rigorous prediction 
performance of these predictors should thus be further 
evaluated using a third patient set from the same cell 
type of leukocytes. Our current predictors were con- 
structed solely based on molecular data due to the lack 
of patients' other clinical information in our datasets. 
However, we believe additional predictive information 
can be obtained from many clinical parameters of 
patients such as age, gender [19], LDL level, HDL level, 
apolipoproteins or triglyceride level. Also one of the 
keys to enhancing the success rate by the prediction 
model in the future is that "six stages of atherosclerotic 
lesion" were used to construct the model, rather than 
only using "0" or "1" to represent the atherogenic risk of 
the atherosclerosis patients. If relevant clinical data are 
available for our model development, we believe predic- 
tion of atherogenic risks can be further improved by 
constructing models both with molecular and clinical 
parameters. 

Conclusion 

We discovered 56 atherosclerosis-related expression bio- 
markers which showed significant differential expression 
between genome-wide microarray data of monocytes 
from patients with familial hyperlipidemia and increased 
risk of atherosclerosis compared to normal controls. 
Based on these biomarkers, we developed our multi- 
gene biomarker model that could predict early-stage 
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(asymptomatic) atherosclerosis across different types of 
leukocytes. We also applied a COXEN algorithm to 
identify concordantly expressed biomarkers between 
monocytes and each of three different cell types of leu- 
kocytes — macrophage, circulating T, and whole white 
blood cells. We then developed three separate multi- 
gene predictors using the three subsets of these 56 bio- 
markers using the monocyte patient data as the training 
set. These individual predictors showed further 
improved statistical power for predicting early-stage 
atherosclerosis. We thus believe these predictors can be 
quite useful for developing diagnostic tools for patients 
who are at an increased risk of cardiovascular disease. 
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